========================================================
I chose to look at 2012 presidential campaign contributions for the state of Ohio. Ohio has long been a critical swing state for presidential elections and according to Wikipedia, of all the swing states, has the current longest streak of matching the overall election outcome (since 1960). Campaign contributions aren’t necessarily the best (or even a strong) predictor of votes, however it does give us some idea of voter sentiment. The ability to predict contribution amount could also be of use to presidential candidates on a campaign trail. This data set has contributions at the zipcode level which, with the help of choropleth maps, will enable us to visualize relationships.
## [1] "zipcode" "candidate" "name" "city" "state"
## [6] "employer" "occupation" "amount" "date" "gender"
## [11] "party" "population" "pcnt_wht" "pcnt_blk" "pcnt_asn"
## [16] "pcnt_hsp" "percap_incm" "med_rent" "med_age"
## 'data.frame': 151479 obs. of 19 variables:
## $ zipcode : Factor w/ 1037 levels "43001","43002",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ candidate : Factor w/ 14 levels "Bachmann, Michele",..: 7 8 7 12 12 12 12 7 12 7 ...
## $ name : chr "BAKER, NANCY" "WHITE, TIMOTHY CHRISTOPHER" "BRIGGS, DEAN" "CHAULK, SARAH" ...
## $ city : Factor w/ 1167 levels ":POLAND","`LOVELAND",..: 10 10 10 10 10 10 10 10 10 10 ...
## $ state : Factor w/ 1 level "OH": 1 1 1 1 1 1 1 1 1 1 ...
## $ employer : Factor w/ 13507 levels "","(SELF) GREEN LEAF LAWN CARE",..: 1006 1238 3163 10429 5653 5653 9833 1006 1662 1006 ...
## $ occupation : Factor w/ 6846 levels "","-","100% DISABLED VIETNAM VETERAN",..: 4148 4102 4649 2594 2975 2975 5237 4148 901 4148 ...
## $ amount : num 35 250 50 546 125 ...
## $ date : Date, format: "2012-06-25" "2011-05-26" ...
## $ gender : Factor w/ 2 levels "female","male": 1 2 2 1 2 2 2 1 2 1 ...
## $ party : Factor w/ 3 levels "democrat","green",..: 1 3 1 3 3 3 3 1 3 1 ...
## $ population : num 2295 2295 2295 2295 2295 ...
## $ pcnt_wht : num 93 93 93 93 93 93 93 93 93 93 ...
## $ pcnt_blk : num 0 0 0 0 0 0 0 0 0 0 ...
## $ pcnt_asn : num 0 0 0 0 0 0 0 0 0 0 ...
## $ pcnt_hsp : num 3 3 3 3 3 3 3 3 3 3 ...
## $ percap_incm: num 34306 34306 34306 34306 34306 ...
## $ med_rent : num 592 592 592 592 592 592 592 592 592 592 ...
## $ med_age : num 46.2 46.2 46.2 46.2 46.2 46.2 46.2 46.2 46.2 46.2 ...
## zipcode candidate name
## 44122 : 2203 Obama, Barack :91286 Length:151479
## 43214 : 1745 Romney, Mitt :50672 Class :character
## 45208 : 1724 Paul, Ron : 4271 Mode :character
## 45243 : 1718 Santorum, Rick: 2012
## 43221 : 1694 Gingrich, Newt: 1432
## 44118 : 1566 Cain, Herman : 583
## (Other):140829 (Other) : 1223
## city state
## CINCINNATI: 18596 OH:151479
## COLUMBUS : 13820
## DAYTON : 5536
## CLEVELAND : 4748
## TOLEDO : 3276
## AKRON : 2965
## (Other) :102538
## employer
## RETIRED :34269
## SELF-EMPLOYED :11503
## NOT EMPLOYED : 8657
## INFORMATION REQUESTED PER BEST EFFORTS: 6049
## INFORMATION REQUESTED : 4217
## (Other) :86741
## NA's : 43
## occupation amount
## RETIRED :38151 Min. : 0.0
## INFORMATION REQUESTED PER BEST EFFORTS: 5778 1st Qu.: 25.0
## HOMEMAKER : 4674 Median : 50.0
## PHYSICIAN : 4458 Mean : 215.5
## ATTORNEY : 4056 3rd Qu.: 150.0
## (Other) :94351 Max. :15000.0
## NA's : 11
## date gender party population
## Min. :2011-01-28 female:67157 democrat :91286 Min. : 0
## 1st Qu.:2012-07-05 male :79580 green : 21 1st Qu.:16076
## Median :2012-09-17 NA's : 4742 republican:60172 Median :25049
## Mean :2012-08-05 Mean :26596
## 3rd Qu.:2012-10-17 3rd Qu.:35078
## Max. :2012-12-31 Max. :68475
## NA's :989
## pcnt_wht pcnt_blk pcnt_asn pcnt_hsp
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Min. : 0.00
## 1st Qu.: 76.00 1st Qu.: 2.00 1st Qu.: 1.000 1st Qu.: 1.00
## Median : 87.00 Median : 4.00 Median : 2.000 Median : 2.00
## Mean : 79.75 Mean :12.23 Mean : 2.941 Mean : 2.84
## 3rd Qu.: 93.00 3rd Qu.:13.00 3rd Qu.: 4.000 3rd Qu.: 3.00
## Max. :100.00 Max. :94.00 Max. :22.000 Max. :61.00
## NA's :1003 NA's :1003 NA's :1003 NA's :1003
## percap_incm med_rent med_age
## Min. : 864 Min. : 213.0 Min. : 6.80
## 1st Qu.:23951 1st Qu.: 550.0 1st Qu.:36.30
## Median :30291 Median : 643.0 Median :39.50
## Mean :32178 Mean : 668.5 Mean :39.25
## 3rd Qu.:38854 3rd Qu.: 749.0 3rd Qu.:43.10
## Max. :67742 Max. :1475.0 Max. :83.50
## NA's :1039 NA's :1761 NA's :1008
The median contribution is $50 but the average is $215. Most contributions were made by males or democrats.
We have long-tailed data. There are so many contributions made under $1,000 that it’s hard to see any of the outliers.
A logarithmic transformation of the x-axis reveals something of a log-normal distribution with what could be a mean of $100. Even so, we can see that the distribution is heavier below this mean.
There are a significant group of people (4416) that, despite the long-tailed distribution, contribute $2,500.
Looking closer we can see that there appear to be several discrete values in increments of $50 that people are accustomed to contributing.
Under $60, we can see contributions spaced in intervals of $5.
An overwhelming majority of contributions (93.5%) were made in 2012.
We can see a steady increase in contributions leading up to the election in November.
There seems to be a slight increase in the amount of contributions made toward the end of the month. There is also a peak at about halfway through the month. People might be making contributions immediately after receiving their paychecks.
Although a significant difference between the amount of male and female contributions, the proportion (0.54 in favor of males) of the gap is not very large.
Most contributions are made to either Obama or Romney. Using a log scale we can see the other candidates a little better. There are a significant amount of contributions made to other candidates but they are mostly Republican. It would probably be better to use party instead of candidate as a predictor.
The proportion of contributions to democrats vs. republicans seems to resemble the proportion in the previous histrogram between Obama and Romney. This would make sense as the number of contributions to other candidates is small in comparison to these two. Also, the amount of contributions to the green party is so small (21 contributions) that we might want to exclude for simplicity.
It’s important to remember that the remaining demographic variables correspond to the contributor’s zipcode and not to the contributor him/herself.
The distribution of population in which contributors live is fairly normal with mean 26596 and median 25049.
Most contributors live in areas with a high percentage of white ethnicity and a very low percentage of black, asian, or hispanic ethnicity.
The distribution of per capita income in which contributors live is fairly normal with most living in zipcodes with a range of per capita income of about $20,000 - $40,000. The average is $32178 and the median is $30291.
Again, we see a fairly normal distribution of median rent in the locations in which contributors live. Rent is very cheap (median of $643) as compared to California but per capita income is also lower.
The median age in which contributors live also resembles a normal distrubtion with median 39.5
A choropleth map of total contributions by zipcode shows that there are hotspots of contributions. Because the zipcodes are so small in these hotspots we might assume that they are cities.
With an overlay of city location, we can see that total contributions are higher nearest to cities.
A map of total population per zipcode is definitely similar to the map of total contributions but it doesn’t seem like an exact match. It could be that the higher the population in a zipcode, the more contributions are made. There may also be more contributions made from affluent suburbs with higher per capita income regardless of total population.
There are 151,479 instances of campaign contribution in the dataset with 19 features. From the original data set 11 features were kept or derived:
Using the zipcode feature, demographic information was added from another data set:
From the original data set, all but name, amount, and date are factors. None of the factors are ordered. Name is a character, amount is numeric, and date is a date object. The 8 demographic features are all numeric.
Other observations:
The main features of interest in the data set are amount, gender, party, per capita income, and median age. I would like to see if these factors are correlated with contribution amount. Occupation could be of interest however there are too many levels (6,846).
I believe that the percent ethnicities, total population, and median rent may be correlated with contribution amount.
I created two new variables, one for the gender of the contributor based upon the first name, and another for the party of the contributor based upon the candidate that received the contribution. I was unable to programatically determine gender by first name for about 4,742 instances (approx. 3% of the data).
A log transformation of contribution amount revealed a log-normal distribution. Despite this, we can see in the non-skewed distribution that there are a significant group of people that donate the maximum allowable campaign contribution by law (approx. $2,600). There are also Political Action Committee (PAC) data in the set which have a larger limit (approx. $5,000). I am unsure of the validity of the outliers beyond this amount because of my limited knowledge of campaign finance law. That there were several negative amounts which needed to be corrected to positive leads me to believe that there could be further inaccuracies in the data set.
The data came in a tidy format and did not need to be transformed.
## amount population pcnt_wht pcnt_blk pcnt_asn
## amount 1.00000000 -0.058151631 0.03627490 -0.035720759 0.04039765
## population -0.05815163 1.000000000 -0.06037725 -0.005842785 0.18167597
## pcnt_wht 0.03627490 -0.060377249 1.00000000 -0.970755919 -0.09407256
## pcnt_blk -0.03572076 -0.005842785 -0.97075592 1.000000000 -0.08506698
## pcnt_asn 0.04039765 0.181675973 -0.09407256 -0.085066978 1.00000000
## pcnt_hsp -0.03254065 0.184148197 -0.22500887 0.065098125 0.07400635
## percap_incm 0.16022853 -0.030815036 0.25904630 -0.312223470 0.46629216
## med_rent 0.07274805 0.156892519 0.12470890 -0.198731584 0.51243822
## med_age 0.08008519 -0.173767903 0.26874345 -0.186164441 -0.22395102
## time -0.11039226 0.032613802 0.01492987 -0.019443600 0.01347171
## pcnt_hsp percap_incm med_rent med_age
## amount -0.03254065 0.1602285332 0.07274805 0.080085191
## population 0.18414820 -0.0308150357 0.15689252 -0.173767903
## pcnt_wht -0.22500887 0.2590463011 0.12470890 0.268743448
## pcnt_blk 0.06509812 -0.3122234697 -0.19873158 -0.186164441
## pcnt_asn 0.07400635 0.4662921634 0.51243822 -0.223951022
## pcnt_hsp 1.00000000 -0.1271504849 -0.04696094 -0.223390263
## percap_incm -0.12715048 1.0000000000 0.74808626 0.334139211
## med_rent -0.04696094 0.7480862610 1.00000000 0.168406483
## med_age -0.22339026 0.3341392105 0.16840648 1.000000000
## time 0.00600853 0.0006534885 0.01281548 -0.003242782
## time
## amount -0.1103922576
## population 0.0326138020
## pcnt_wht 0.0149298671
## pcnt_blk -0.0194436000
## pcnt_asn 0.0134717088
## pcnt_hsp 0.0060085300
## percap_incm 0.0006534885
## med_rent 0.0128154786
## med_age -0.0032427822
## time 1.0000000000
None of the numeric variables seem to be strongly correlated with amount although all are significantly correlated with it (absolute value greater than 3%). Despite the increase in contributions towards the election, the amount is negatively correlated with increasing time.
The factored variables of gender and party were not included in the correlation analysis or pairs plot so we should take a closer look at these.
The mean contribution amount as well as the IQ range is larger for males than females.
The mean contribution amount as well as the IQ range is larger for republicans than democrats.
The following are plots of variables of interest by amount with correlation statistics.
## [1] 0.1584399
## [1] -0.1104779
## [1] 0.08065799
## [1] 0.07274805
## [1] -0.05841065
## [1] 0.03966285
## [1] 0.03676803
## [1] -0.03602072
## [1] -0.03267223
Despite some significant correlation values here, they are difficult to see when plotted. I think that these relationship are very weak as far as being able to predict contribution amount.
Here we can see that contributions early on tend to be larger and with a greater range. The data for 2011 contributions is also much smaller than for 2012 so this may be a factor.
Neither month nor day seem to correlate much with amount.
In this choropleth map, the location of the population centers are not as apparent. Average contribution amount doesn’t seem to correlate with a city center as much as total contributions.
Here we can see the population centers.
Average contribution amount may more closely resemble per capita income rather than total population. Higher values surround the city centers with something of a buffer. Also, there are some areas away from the city centers with high average contribution amount. These could be affluent rural areas or areas with little data.
Total amount of contribution per zipcode does again seem to correlate with city proximity.
Relatively speaking, there were no strong relationship discovered in the numerical correlations. There were however significant correlations (abs. value > 3%) among all of the variables and amount. Per capita income had the strongest correlation (16%) followed by time (numerical date), median age, median rent, total population, percent asian, percent white, percent black, and percent hispanic with the lowest (-3.25%).
Time, total population, percent black, and percent hispanic were all negatively correlated. Among the ordered factors of gender and party, republican contributions were on average higher than democrat, as were male contributions higher than female.
I did not expect that time and total population would be correlated with contribution amount. It appears that early on contributions are largest, which might make sense to support a candidate for a longer campaign. Total population seems a bit arbitrary as zipcodes are not necessarily zoned for equal area.
The strongest relationship among all the variables was that between percent white and percent black of a contributors location. These are negatively correlated at 97%. Among the variables of interest the strongest correlation was between per capita income and amount which I suspected to be so.
Contributions to the Democratic party came from a slight female majority whereas contributions to the Republican party came from an overwhelming male majority.
Although a relatively small proportion, males tend to contribute slightly more at higher amounts to the Democratic party than females. There appears to be no caveat for the Republican party (males make more contributions at all amounts). Also, at higher amounts, Republicans make more contributions than Democrats.
Plotting relationships between variables of interest and amount by gender and party. Each plot contains both a LOESS and LM smoothing method.
The positive correlation between contribution amount and per capita income seems to be much more pronounced with Republicans than Democrats.
Both males and Republicans seem to ‘rally’ behind their candidate leading up to election time with an increase in contribution amount as compared to their female or Democrat counterparts.
Although this demographic information does not necessarily reflect the contributor, both Republicans and males show a stronger positive correlation between median age of their locale and personal contribution amount than do their female or Democrat counterparts.
Median rent by party is similar to per capita income by party.
Population by gender or party does not give us much more insight.
The percent ethnicities seem to show erratic trends with LOESS.
Contribution amount per person seems to more closely resemble the map of average contribution amount rather than total contributions.
##
## Calls:
## m1: lm(formula = I(log(amount)) ~ I(percap_incm), data = subset(model_df,
## amount < 2700))
## m2: lm(formula = I(log(amount)) ~ I(percap_incm) + gender, data = subset(model_df,
## amount < 2700))
## m3: lm(formula = I(log(amount)) ~ I(percap_incm) + party, data = subset(model_df,
## amount < 2700))
##
## ===============================================================
## m1 m2 m3
## ---------------------------------------------------------------
## (Intercept) 3.573*** 3.347*** 3.318***
## (0.011) (0.011) (0.010)
## I(percap_incm) 0.000*** 0.000*** 0.000***
## (0.000) (0.000) (0.000)
## gender: male/female 0.438***
## (0.007)
## party: republican/democrat 1.074***
## (0.007)
## ---------------------------------------------------------------
## R-squared 0.028 0.054 0.177
## adj. R-squared 0.028 0.054 0.177
## sigma 1.335 1.317 1.229
## F 4128.020 4108.751 15532.538
## p 0.000 0.000 0.000
## Log-likelihood -247362.113 -245400.794 -235324.957
## Deviance 258224.633 251323.923 218675.314
## AIC 494730.226 490809.588 470657.915
## BIC 494759.876 490849.120 470697.448
## N 144815 144815 144815
## ===============================================================
Our best model only accounts for about 18% of the variation in donation amount.
I looked at amount against all of the significantly correlated features but adding in gender and/or party as a third variable. In nearly all of the comparisons gender and party proved to be a significant in differentiating total contribution amount. Specifically, contribution amount was higher for males than females as was it for Republicans versus Democrats. Also, because of the majority male constituency for Republican doners, we see that male trends generally mirror Republican trends as do female trends mirror Democratic trends.
I knew that gender and party might be significant factors but I did not know to what extent (these features could not be analyzed in the correlation table). I was somewhat surprised to see that differences in gender and party were universal across all other features with respect to contribution amount. An interesting finding was that Democratic contribution amounts show little increase with increasing per capita income of the contributors demographic as compared to Republican contribution amounts. Also, looking at contribution amount over time by party showed that, despite both parties having larger contribution amounts earlier on, Republicans increased their contribution amount leading up to the election whereas Democrats do not. The same is true for males over females but the relationship is less pronounced. This sort of last minute increase in contribution amount reminds me of a type of rally behavior. Whether or not this is effective in catapulting a candidate to nomination is a whole other question altogether but I doubt it to be so (especially since Romney lost Ohio in 2012). Another interesting difference between males and females is that as the median age of the demographic of the contributor increases, male contribution amount tends to increase whereas female contribution shows a slight decrease. The same idea applies for Republicans and Democrats, Republicans showing an increase in amount as the median age of the contributor’s zipcode increases but holding steady for Democrats. If median age of the contributors zipcode did in fact reflect the actual age of the contributor, we could hypothesize that females are less inclined to donate large amounts as they get older.
I experimented with several different models and found that one which modeled the log of amount with per capita income, gender, and party was the most effective in explaining variance in contribution amount. I tried to include time as a factor because it was highly negatively correlated with amount however adding this feature only decreased the R-squared value. Adding all of the remaining significantly correlated features had the same effect to decrease the R-squared value.
The first plot shows both how contribution amounts are distributed log-normally and how contributions increase leading up to an election.
This plot grid shows that, in Ohio, most Democratic contributions were made by females and an overwhelming majority of Republican contributions were made by males. Also that, in general, Republican and male contribution amounts are higher than Democrat and female contribution amounts. An interesting rally phenomenon can be seen here with Republican party as the election date approaches.
This final plot shows that, despite total contribution amount being highly correlated with city center, average contribution amount is highest surrounding a city and also in some rural areas. This may shed light on why presidential candidates spend a significant amount of time campaigning in suburbs and seemingly rural areas.
In my investigation of 2012 Presidential Campaign Contributions for the state of Ohio, I chose to focus on finding the most significant features of a contributors information that could be used to predict the actual contribution amount. The most significant features proved to be per capita income of the area in which the contributor lives (which we can assume gives an idea of the contributor him or herself), the gender of the contributor, and the political party affiliation of the contributor (simplified to be either Democrat or Republican). Per capita income of the contributor’s zipcode has a positive correlation with the contributor’s contribution amount. Males have on average higher contribution amounts than females as do Republicans versus Democrats. This would lead one to conclude that given these correlated features, the highest contribution amount could belong to a male Republican who lives in an area with high per capita income. The lowest contribution amount might belong to a contributor who is a female Democrat and who lives in an area with a low per capita income.
Despite these findings, the model that was developed was only able to account for about 18% of the variation in contribution amount. It would have been great to have actual income, age, and ethnicity of the contributor him/herself. I believe that these would have had a much higher correlation than the demographic information of the contributors zipcode. The demographic information was at best a rough approximation of the contributor.
With regard to the choropleth maps, it seemed apparent that total contributions and total contribution amount were highly correlated with city proximity. Average contribution appeared to be higher closer to cities, but with a buffer between the actual city center and high average contribution amounts.
There are several shortcomings of the data set. First, I question the validity of some of the information as several contribution amounts had to be changed from negative to positive. Second, as compared to other zipcodes, some lacked a substantial amount of data. This may have skewed the average contribution choropleth map. Another shortcoming was the inability to programatically determine the gender by first name of the contributor for about 4% of the data. I had a difficult time with regex in R, were I more adept at this, that data might have been included.
If possible, further analysis could include distance, or some measure of proximity, to a city center. The choropleth maps that were generated attempted to show a spatial relationship between amount and cities. This was however at best an approximation without any concrete measurements to back-up the claims/insights. To do this, an average latitude and longitude value could be calculated and added as a variable for each zipcode, and another variable could be added for distance to the closest city.